--- layout: page title: Script 3 - Bivariate Regression permalink: /scripts/script3/ parent: R Scripts nav_order: 3 --- Script 3: Bivariate Regression

Run the code in this entire script in your own local R script - it is crucial to run the code yourself to familiarise yourself with the coding language. In general and for your own sake, I strongly recommend you always annotate your code including the output or result of your code unless it is trivial or already discussed below. Always make sure to be explicit about output and substantive results when responding to the tasks!



Main Commands Used in this Session

cov() # to calculate the covariance
cor()# to calculate the correlation
dim() #shows the dimension of a dataframe
lm() # linear model  (regression)
plot() # to plot how observations of two variables relate to each other
subset() # apply a function to subgroups
abline() # to fit in a straight line in a plot
scatterplot() # from the CAR package. creates a plot that also shows non-linear fits
summary() # to check the variables of a data set
boxplot() # create a boxplot
hist() #to look at how the variable is distributed across observations as a histogram
densityPlot() #form the CAR package. Creates a density plot to check a variable's distribution
stargazer() #from the stargazer package. To print publishable tables

In this script, we will focus on working with observational data and predicting patterns in data. Until now you have been describing patterns.

After this lab you will be able to predict the value of a dependent variable (Y) with an independent variable (X). In other words, we assess the existence of a relationship between the two variables.

Once we established the relationship, we will be able to make estimates, that is predict values of Y even when we don’t know the respective value for a range of X. We will first go through what the regression equation actually means and then apply this on a real dataset looking at the Brexit vote.


Regression: Ordinary Least Squares (OLS)

The simplest relationship between an independent and a dependent variable is a straight line. Such a regular linear regression equation is expressed as follows:

\[Y=\alpha+\beta X\]

\(\alpha\) is referred to as the constant, or intercept, and denotes the point where the regression line ‘intercepts’ the Y axis. In other words, it estimates the expected value of Y when X equals zero.

\(\beta\) is the coefficient, or slope, and estimates the average change in Y associated with a unit change in X.

Using this function, it is possible to predict Y if you know X.

For example, you may or may not be familiar with the following equation: \(Y= 32 + 1.8X\). Here, the temperature in Fahrenheit is an exact linear function of temperature in Celsius. This relationship is deterministic. However, in the social sciences relationships between variables are almost always inexact, i.e., it might be likely for them to take on certain values, but we can’t say for sure. Therefore, the regression equation is more realistically written as:

\[Y_i=\alpha+\beta X_i + \epsilon_i\] where \(\epsilon\) represents the error term. This relationship is probabilistic, allowing variability in the values of Y for each value of X.

To find the ‘best’ line - that is, a line with the smallest errors - we need to minimize the sum of squared errors:

\[SSE = \sum(Y_i-\hat{Y}_i)^2\]

where \(Y_i\) is the actual value and where \(\hat{Y}_i\) is the predicted value for each unit. The total sum of the errors will include both positive and negative numbers, cancelling each other out. Squaring each error overcomes this problem of opposite signs. The technique that minimizes the sum of the squared errors for linear regression is also known as Ordinary Least Squares estimation (OLS).

In R, the general formula for running a simple linear regression is

lm(dependent variable ~ independent variable, data = ...)

Working with Observational Data

Next, we apply the technique of OLS regressions to try to understand what the driving factors were behind the Leave vote in the 2016 Brexit referendum. It is always a good idea to start your research by eyeballing the data and getting some basic information on it. But before moving on, let’s set the working directory and clear our workspace.

#We can set the working directory 
setwd("~/Desktop/DataAnalysis/script3") 

#clear our workspace
rm(list = ls())

This time, we are also going to install a package to help us plot results later on.

Packages include commands and functions that are not included in base R and often have been created by researchers to serve their particular needs. It’s important to know how you can install packages on your own.

#we download a package called "Companion to Applied Regression"
#it helps us to create some of the graphs we are going to use

install.packages("car") #here we install the package, note that quotation marks matter!
library(car) #next we load it. You have to repeat this step again after re-launching R Studio.

Make sure you have downloaded the referendum data and moved it to the folder you set as working directory. Next, we load the dataset and check it using the View() function.

brexit_data <- read.csv("brexit_data.csv") # we use read csv since we are using a csv file
View(brexit_data) # we view the data in a new window

# If you can't remember your file name, or can't find the data
# you can always use the following command to manually locate a file
brexit_data <- read.csv(file.choose())

To inspect the data further we can use summary().

#first we summarize our data
summary(brexit_data)

Each observation of the brexit_data dataset contains information about how each of the 182 regional units (called “local authorities” in the UK) in England, Scotland and Wales voted and some key demographic information about the unit.

The key variables are:

Variable Name Variable Description
Area The name of the electoral unit
Region Name of the region a unit belongs to. Note: a region contains several units
Percent_Turnout The percentage of people who voted
Percent_Remain Percentage of people who voted Remain
Percent_Leave Percentage of people who voted Leave
Health_good_percent The percentage of people with good health in the area
Earnings_Median The median earnings in the area
age_mean The mean age in the area
Birth_UK_percent The percentage of UK-born people in the area

Note that these variables come in various measures: For example, age and earnings are coded in their median and mean, respectively; Remain and Leave votes in percentages. Be careful about what variable you call because it influences how you interpret the results!

Let’s explore the dataset in a few simple commands that we have learned in previous labs:

str(brexit_data) #display structure of the dataset
dim(brexit_data) #the number of observations and variables in the data
hist(brexit_data$Percent_Turnout) #how do you interpret the histogram?
hist(brexit_data$Percent_Leave)
hist(brexit_data$age_mean)

How do you interpret these histograms?

Let’s have a closer look at the Leave-vote:

summary(brexit_data$Percent_Leave)
boxplot(brexit_data$Percent_Leave) #How does the the boxplot relate to what you see above?
densityPlot(brexit_data$Percent_Leave) #from the CAR-package

Recalling the commands from the previous script, we can also look at the distribution of the Leave-vote for two groups, individuals who live in an area where the mean age is over 40 and those who live in an area where the mean age is below 40.

under40 <- subset(brexit_data, brexit_data$age_mean<40)
over40 <- subset(brexit_data, brexit_data$age_mean>40)

par(mfrow=c(1,2))
#"par" sets graphical parameters, "mfrow" allows to produce subsequent figures following an nr-by-nc array. In this case, a 1 row by 2 columns layout.

hist(under40$Percent_Leave) 
# hist(under40$percent_leave, ylim=c(0,50)) ***Aligns Y Axis***
hist(over40$Percent_Leave)

par(mfrow=c(1,1))
#here, we fit our figures in a 1row-by-1colume array
hist(under40$Percent_Leave)
hist(over40$Percent_Leave)

Now that you know how the Leave vote is distributed across the dataset, what would be your best guess if you had to provide one likely value for a Leave percentage in any given constituency?

Remember that we know that the Leave percentage varies across constituencies. There will always be errors, but we want to minimize the SSE - the sum of squared errors - from our fitted straight line from above. In principle, the single value that is likely to be closest to the Leave vote share in all constituencies and thus minimizes the sum of squared errors (provided we don’t have any other information) is the mean!

But, in reality, there is meaningful variation among units and some other factors might be useful in explaining the variation in the Leave vote. Would the mean age of a constituency explain the level of voting Leave? Health standards of the people living there? Their income level? The first step in regression analysis is to think of a plausible relationship between two variables and then unravel this relationship.


Bivariate Relationships

To understand overall relationships, scatter plots can be a more effective way than simply comparing histograms or maps.

Let’s plot a well known correlation that came to play during the Brexit referendum: age and electoral outcomes. Do you see a relationship in the graph? We can add the lines for the means of age and Leave vote share to see how the observations are distributed with respect to these values.

par(mfrow=c(1,2))
plot(brexit_data$age_mean, brexit_data$Percent_Leave)

plot(brexit_data$age_mean, brexit_data$Percent_Leave)

abline(v=mean(brexit_data$age_mean, na.rm=TRUE), col = "red")
abline(h=mean(brexit_data$Percent_Leave[!is.na(brexit_data$age_mean)], na.rm=TRUE),
       # we do not want to include missing values
       col = "blue")
#"[!is.na(age_mean)], na.rm=T": tells R not to use missing values in age.
#missing values would keep R from calculating the mean and its relationship with the Leave variable.
par(mfrow=c(1,1))

Regression Analysis

The Logic of Regressions

Recall that the regression coefficient is calculated based on the covariance of \(X\) and \(Y\). If the values of the observations in a sample differ a lot, the variance is high, if the values are relatively similar, the variance is low.

For instance, take the list of numbers 2, 9, 67. We calculate the variance by summing the squared difference between each number and the mean, and divide by the number of observations (minus one if the numbers are a sample rather than a population):

\[ \sigma^2=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1} \]

numbers <- c(2,9,67)
#to create a vactor containing the numbers 2, 9 and 67
numbers

num_mean <- mean(numbers)
#to compute the mean value for all the observations in the vector numbers
num_mean

num_var_sample <- 1/(3-1) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the sample variance
num_var_sample

num_var_pop <- 1/(3) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the population variance
num_var_pop

# More simply:
var(numbers) # note var() automatically makes a small-sample adjustment

The variance is high (849 if the numbers represent the whole population, 1273 if it concerns a sample).

Recall that the variance represents the average squared distance between observations and the mean. The standard deviation, accordingly, is the root of this average distance: 36.

sd(numbers)

To summarise the relationship between two variables, we can further calculate the correlation. Correlations are a standardised and non-unit-specific way of measuring the relationship between two variables. They represent the degree to which one variable is linearly associated with another, however, without revealing the magnitude of that association.

cor(brexit_data$age_mean, brexit_data$Percent_Leave, use = "complete")
#the correlation is 0.38 on a scale from -1 to 1.

Since the correlation is positive, we know that as mean age increases so, too, does the percentage of Leave voters in the regional unit.

But how much does a one year increase in age add to the Leave vote in general terms? Let’s use regression to find out!

In linear regression analysis, we estimate the coefficient \(\hat{\beta}\) by dividing the covariance of \(X\) and \(Y\) (the extent to which \(X\) and \(Y\) move together) by the variance of \(X\) (how much variation there is in \(X\), simply speaking). In other words: To what extent do the different values that \(X\) takes explain the relationship between \(X\) and \(Y\)?

\[ \hat{\beta}=\frac{\operatorname{COV}(X, Y)}{\operatorname{VAR}(X)} \]

To calculate the coefficient manually, we start by calculating the covariance of age and Leave voting:

age_leave_cov <- cov(brexit_data$Percent_Leave, brexit_data$age_mean, use = "complete")
#note the extra "use" command at the end.
#missing observations should be ignored, otherwise the command won't run.
age_leave_cov

Next we can calculate the variance in our \(X\)-variable, i.e. age:

age_var <- var(brexit_data$age_mean, use = "complete")
age_var

Finally, we can divide the covariance of \(X\) and \(Y\) by the variance of \(X\) to estimate the regression coefficient \(\beta\):

beta <- age_leave_cov/age_var
beta

We see that - based on our linear estimation - a one year increase in the mean age of the electoral district is associated with an increase in the Leave vote share of the area by 1.288 %-points on average.


Linear Regression in R

Doing this calculation manually, however, is time-consuming. Fortunately, R can easily calculate the regression coefficients for us (including the intercept value, too, which is automatically added to the model):

mod1 <- lm(brexit_data$Percent_Leave ~ brexit_data$age_mean)
#note that the results are stored in the working environment. Using the output of the lm command is not the most efficient way to interpret the regression - feel free to try and compare it to the code below.

#Usually, we print the output using the summary command:
summary(mod1)

You can see that the coefficient value for age_mean is the same as the manual value (\(\hat{\beta}\)) we computed above.

Next, we might wonder how much of the variation in \(Y\) is explained by the variation in \(X\). We can find out by asking for the R-squared, which broadly indicates what portion of the variation is being explained by the regression model:

summary(mod1)$r.squared

We get an R-squared of 0.1463, which means that about 15 percent of the variation in the Leave vote is explained by age. Regression tells us what accounts for the variation in the outcome when the mean (in this case the mean Leave vote) is not the best available explanation for a value of \(Y\). This means that our model is “this much better” than using the mean. Here, we conclude that age does account for a meaningful part of the observed variation: constituencies with a higher mean age are more likely to vote Leave. So, we get a better prediction of how a constituency votes by knowing the age of the voters than just relying on the mean Leave vote.

Moreover, when we fit our estimated linear regression line, we can predict the Leave vote share for mean ages which are not included in our data.

plot(brexit_data$Percent_Leave ~ brexit_data$age_mean, xlim =c(30,50),
ylim =c(3,80), ylab = "Leave vote share", xlab = "Mean age", pch=16, col="blue")
abline(mod1)

# Recall why we specify the arguments in the command above: We want to create a figure that is as insightful, clear and parsimonious as possible!
#xlim = what are the limits of the x-axis? Mean ages are between 30 and 50.
#we can find this out by
min(brexit_data$age_mean, na.rm=TRUE)
#alternative code:min(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
max(brexit_data$age_mean, na.rm=TRUE)
#alternative code:max(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
#ylim = what are the limits of the y-axis?
min(brexit_data$Percent_Leave)
max(brexit_data$Percent_Leave) #no missing values here
#Leave vote share between 4 and 76
# xlab and ylab for naming the y and x axes
# pch and col just to make the dots prettier!
#if you just want to get a simple graph as a first idea, just run
plot(brexit_data$Percent_Leave ~ brexit_data$age_mean)
abline(mod1)

At the constituency mean age of 40.23808, we predict a 54 percent vote share for Leave. You can also calculate this manually without the graph by using the equation

\[\hat{Y}=\hat{\beta} X\] Based on our simple bivariate model, our estimated value of the Leave vote will be

\[2.611 + (1.288*age_{mean}) = 2.611 + (40.23808*1.288) = 54.44 \] which is very close to what we got by looking at the graph.

Manual calculation:

#estimated % Leave share = 2.611 + 1.288 * mean age
mean(brexit_data$age_mean,na.rm = TRUE)
beta # as we calculated before

# Note that you can call the estimated intercept and regression slope from the OLS model we saved earlier
mod1$coefficients[1]+mod1$coefficients[2]*mean(brexit_data$age_mean,na.rm = TRUE)
#estimated share at another value of age:
mod1$coefficients[1]+mod1$coefficients[2]*45

# Double check
2.611+1.288*40.23808

We now turn to a different question: Did immigration levels correlate with the Leave vote - and, if so, can we predict the Leave vote share based on the level of immigration in a given constituency?

Let’s try to answer this by regressing the Leave vote share on the percentage of UK-born residents in each constituency.

UKborn <- lm(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent)
summary(UKborn) #to print out the result

We get a coefficient of 0.58210 for Birth_UK_percent. This means that a one percent increase in the proportion of UK-born individuals is associated with an average 0.58 %-point rise in the Leave vote share. The more UK-born residents live in the electoral district, the higher the Leave vote!

Let’s also look at the correlation between these two variables:

cor(brexit_data$Percent_Leave, brexit_data$Birth_UK_percent, use="complete")

How do you interpret the regression coefficient? Is it a better predictor than age? Let’s plot this and include our line of best fit.

plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, pch=16, col="blue")
abline(UKborn)

Now, let’s try to look at this same relationship by creating a variable for the share of foreign-born residents in each constituency, before regressing the Leave vote on this newly created variable. [Can you think of another way to create this variable?]

brexit_data$foreignborn_percentage <- brexit_data$Birth_other_EU_percent +
brexit_data$Birth_Africa_percent +
brexit_data$Birth_MidEast_Asia_percent +
brexit_data$Birth_Americas_Carrib_percent +
brexit_data$Birth_Antarctica_Oceania_Other_percent
#foreignborn_percentage is the new variable

cor(brexit_data$foreignborn_percentage,
    brexit_data$Percent_Leave, use="complete") #this time negative

foreigners <- lm(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage)
summary(foreigners)
#foreigners stores the results of this regression, to view just type "foreigners"

plot(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage, pch=16, col="blue")
abline(foreigners)

A concern we may have at this point is whether our results are possibly driven by an outlier. We can get a better idea about what is influencing our estimates by asking R to label our observations.

plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, type="n")
text(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, labels=brexit_data$Region, cex=0.5)

We don’t really see much, but we can see that areas on the lower end of both the Leave vote share and UK-born residents are located in London. Let’s subset the data to exclude London, and see if our estimates change.

#how many districts are in the London region?
length(brexit_data$Region[brexit_data$Region=="London"])

#we subset the data to exlude the 33 areas in London
withoutLondon <- subset(brexit_data, Region != "London")
dim(withoutLondon)
#The new number of observations is 349.
#This comes from substracting London's 33 areas form the original total of 382.

#and now we re-run the regression
immigWL <- lm(withoutLondon$Percent_Leave ~ withoutLondon$Birth_UK_percent)
summary(immigWL) #what happened to our coefficient?

What do you make of this estimate? How does it relate to your previous model? Provide a substantive interpretation.

Linear regression models often provide the most important insights in data analysis - it’s not by chance that they have become the gold standard of social science analysis. Let’s present our regression result in a nice way - we can use the stargazer package to create a decent-looking regression table.

install.packages("stargazer") #note the quotation marks
library(stargazer)

# For example:
fit_example <- lm(yvar ~ xvar)
stargazer(fit_example, type="text")

stargazer(mod1,immigWL, type="text")

# Fully annotated/presentable version

stargazer(mod1,immigWL, 
          column.labels=c("Leave (%)"), 
          column.separate = (2),   # telling R to use the label for the first two columns
          dep.var.labels.include = FALSE, 
          covariate.labels=c("Age (Mean)", "Born in UK (%)"), 
          type="text")

Alternatively, you can indicate type=latex to make stargazer produce latex code. You can then paste this code in whatever programme you use for Latex on your machine or overleaf - the latter comes in handy for producing professional-looking tables even if you don’t usually use Latex.


Exercise

Immigration Attitudes: The Role of Economic and Cultural Threat

Why do the majority of voters in the U.S. and other developed countries oppose increasing immigration? According to the conventional view and many economic theories, people simply do not want to face additional competition in the labor market (the economic threat hypothesis). Nonetheless, most comprehensive empirical tests have failed to confirm this hypothesis and it appears that people actually often support policies that are not in their personal economic interest. At the same time, there has been growing evidence that immigration attitudes are rather influenced by various deep-rooted ethnic and cultural stereotypes (cultural threat hypothesis). Given the prominence of workers’ economic concerns in the political discourse, how can these findings be reconciled?

This exercise is based in part on Malhotra, N., Margalit, Y. and Mo, C.H., 2013. “Economic Explanations for Opposition to Immigration: Distinguishing between Prevalence and Conditional Impact.” American Journal of Political Science, Vol. 38, No. 3, pp. 393-433.

The authors argue that, while job competition is not a prevalent threat and therefore may not be detected by aggregating survey responses, its conditional impact in selected industries may be quite sizable. To test their hypothesis, they conduct a unique survey of Americans’ attitudes toward H-1B visas. The plurality of H-1B visas are occupied by Indian immigrants, who are skilled but ethnically distinct, which enables the authors to measure a specific skill set (high technology) that is threatened by a particular type of immigrant (H-1B visa holders). The data set immig.csv has the following variables:

Variable Name Variable Description
age Age (in years)
female Dummy: 1 indicates female; 0 indicates male
employed Dummy: 1 indicates employed; 0 indicates unemployed
nontech.whitcol Dummy: 1 indicates non-tech white-collar work (e.g. law), 0 everything else
tech.whitcol Dummy: 1 indicates high-technology work, 0 everything else
expl.prejud Explicit negative stereotypes about Indians (continuous scale, 0-1)
impl.prejud Implicit bias against Indian Americans (continuous scale, 0-1)
h1bvis.supp Support for increasing H-1B visas (5-point scale, 0-1)
indimm.supp Support for increasing Indian immigration (5-point scale, 0-1)

The main outcome of interest (h1bvis.supp) was measured as a following survey item: “Some people have proposed that the U.S. government increase the number of H-1B visas, which are allowances for U.S. companies to hire workers from foreign countries to work in highly skilled occupations (such as engineering, computer programming, and high-technology). Do you think the U.S. should increase, decrease, or keep about the same number of H-1B visas?” A higher value reflects greater support for increasing H-1B visas.

Another outcome (indimm.supp) similarly asked about the “the number of immigrants from India.” Both variables have the following response options: 0 = “decrease a great deal”, 0.25 = “decrease a little”, 0.5 = “keep about the same”, 0.75 = “increase a little”, 1 = “increase a great deal”.

To measure explicit stereotypes (expl.prejud), respondents were asked to evaluate Indians on a series of traits: capable, polite, hardworking, hygienic, and trustworthy. All responses were then used to create a scale ranging from 0 (only positive perceptions of traits of Indians) to 1 (no positive perceptions of traits of Indians). Implicit bias (impl.prejud) is measured via the Implicit Association Test (IAT) which is an experimental method designed to gauge the strength of associations linking social categories (e.g., European vs Indian American) to evaluative anchors (e.g., good vs bad). Individuals who are prejudiced against Indians should be quicker at making classifications of faces and words when European American (Indian American) is paired with good (bad) than when European American (Indian American) is paired with bad (good). If you want, you can test yourself here.

Task 1

Start by examining the distribution of immigration attitudes. What is the proportion of people who are willing to increase the quota for high-skilled foreign professionals (h1bvis.supp) or support immigration from India (indimm.supp)?

Now compare the distribution of two distinct measures of cultural threat: explicit stereotyping about Indians (expl.prejud) and implicit bias against Indian Americans (impl.prejud). Create a scatter plot, add a linear regression line to it, and calculate the correlation coefficient. Based on these results, what can you say about their relationship?

Task 2

To investigate the impact of cultural threats on policy attitutes, regress H-1B and Indian immigration attitudes on explicit and implicit prejudges (expl.prejud and impl.prejud), respectively, so that you have four different models in total. For each of the models, create a scatter plot and add the linear regression line to it. Do you agree that cultural threat is an important predictor of immigration attitudes as claimed in the literature?

If the labor market hypothesis is correct, opposition to H-1B visas should also be more pronounced among those who are economically threatened by this policy such as individuals in the high-technology sector. At the same time, tech workers should not be more or less opposed to general Indian immigration because of any specific economic considerations. Regress both measures of immigration on the indicator variable for tech workers (tech.whitcol). Simply speaking, is there a relationship between the dependent variable (immigration attitudes) and our explanatory variable (tech workers)? How would you interpret the coefficients for tech.whitcol? Do the results support the hypothesis? Is the relationship different from the one involving cultural threat and, if so, how? What does the R-squared tell us about model fit?